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Abstract 

We investigate supervised learning in neural networks. We consider a multi-layered feed-forward 
network with back propagation. We find that the network of small- world connectivity reduces the 
learning error and learning time when compared to the networks of regular or random connectivity. 
Our study has potential applications in the domain of data-mining, image processing, speech 
recognition, and pattern recognition. 
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I. INTRODUCTION 



The concept of small-world networks and the variant of scale free networks has become 
very popular recently l], 0^ , after the discovery that such networks are 

realized in diverse areas as the organization of human society (Milgram's experiment) Q], 
in the WWW 0, 0], in the internet in the distribution of electrical power (western 
US) Q], and in the metabolic network of the bacterium Escherichia coli js, III- 

According to Watts and Strogatz, a small- world network is characterized by a clustering 
coefficient C and a path length L jlj. The clustering coefficient measures the probability 
that given node a is connected to nodes h and c then also nodes h and c are connected. 
The shortest path length from node a to node h is the minimal number of connections 
needed to get from a to h. A small-world network is defined by two properties. First, the 
average clustering coefficient C is larger than for a corresponding random network with the 
same number of connections and nodes. The clustering coefficient expected for a regular 
rectangular grid network is zero while for a random network the probability p of connection 
of two nodes is the same for neighboring nodes as for distant nodes. Second, the average 
path length L scales like log A^, where A^ is the number of nodes. For regular (grid) networks 
L scales as A^'^ where d is the dimension of the space and for random networks L scales as 
logA^. As result, C is large and L is small in a small-world network. 
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FIG. 1: Scheme of network connection topologies obtained by randomly cutting and rewiring 
connections, starting from regular net (left) and going to random net (right). The small- world 
network is located somewhere in between. 
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In studies of small-world neural networks it has turned out that this architecture has 
potentially many advantages. In a Ho dgkin- Huxley network the small-world architecture 
has been found to give a fast and synchronized response to stimuli in the brain [12]. In 
associative memory models it was observed that the small-world architecture yields the 
same memory retrieval performance as randomly connected networks, using only a fraction 
of total connection length Likewise, in the Hopfield associative memory model the 

small- world architecture turned out to be optimal in terms of memory storage abilities jl^ . 
With a integrate-and-fire type of neuron it has been shown that short-cuts permit self- 
sustained activity. A model of neurons connected in small-world topology has been used to 
explain short bursts and seizure-like electrical activity in epilepsy jlq. In biological context 
some experimental works seem to have found scale-free and small-world topology in living 
animals like in the macaque visual cortex in the cat cortex ^3] and even in networks of 
correlated human brain activity measured via functional magnetic resonance imaging jisl . 
The network of cortical neurons in the brain has sparse long ranged connectivity, which 
may offer some advantages of small world connectivity 

II. MODEL AND GEOMETRY OF NETWORK 

In the present work we study supervised learning with back-pro pag ation in a multi- 
layered feed-forward network. Around 1960, the Perceptron model j20] has been widely 
investigated. The information (representing action potentials propagating in axons and 
dendrites of biological neurons) feeds in forward direction. The Perceptron model has 
been extended to include several layers. E.g., a convolutional network, consisting of seven 
layers plus one .p.. ,a.. .a. been .ed to^ead., cheque, • T.e tas. of ,ea.nin, 
consists of finding optimal weights Wij between neurons (representing synaptic connections) 
such that for a given set of input and output patterns (training patterns) the network 
generates output patterns as close as possible to target patterns. We used the algorithm 
of back propagation j3| to determine those weights. There are alternative, potentially 
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FIG. 2: Diocal and Dgioi,ai versus number of rewirings. Networks with 5 neurons per layer and 5 
layers. 

faster methods to determine those weights hke, e.g. simulated annealing. Here we aim 
to compare different network architectures with respect to learning, using as reference a 
standard algorithm to determine those weights. 

We change the architecture of neural connections from regular to random architecture, 
while keeping the number Kconn of connections fixed. Initially, neurons are connected feed- 
forward, i.e. each neuron of a given layer connects to all neurons of the subsequent layer. 
We make a random draw of two nodes which are connected to each other. We cut that 
"old" link. In order to create a "new" link, we make another random draw of two nodes. If 
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those nodes are already connected to each other, we make further draws until we find two 
unconnected nodes. Then we create a "new" link between those nodes. Thus we conserve 
the total number Kconn of links. The number of links in a regular network is given by 
Kconn = N^euron ^ {flayer " 1), whcre N neuron dcuotcs the uumbcr of ueurous per layer 
and Niayer denotes the number layers. In this way we create some connections between 
nodes in distant layers, i.e. short-cuts and the topology changes gradually (see Fig. P). In 
particular, while the initial connectivity is regular, after many rewiring steps the topology 
becomes random. Somewhere in between lies the small- world architecture l| . We consider 
networks consisting of one input layer, one output layer and one or more hidden layers. 
While the standard Perceptron model has only two layers, here we explore the use of more 
hidden layers, because only then the path length L can become small and the network 
" small- world" . 

Instead of measuring the network architecture by the functions C and L, we have used 
the functions of local and global connectivity length Diocai and D global, respectively, in- 



troduced by Marchiori et al. [23|, |2^. Those functions are more suitable because they 
allow to take into account the connectivity matrix and the weight matrix in networks and 
also to treat networks with some unconnected neurons (often nodes are unconnected due 
to rewiring). Diocai and D global are defined via the concept of global and local efficiency 
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TABLE I: Network parameters. -/V^^^e ^^'^ ^^obai correspond to position where both Diocai and 
D global are small. Scaling of R (Eq.Q) and of D global (Eq.Q)). 
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FIG. 3: Network of 5 neurons per layer and different number of layers. Learning of 20 patterns, 
learning rate 0.02 (back-propagation), training by 10 statistical tests. 



(23|, |2J|. For a graph G its efficiency is defined by 

1 
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where dij denotes the shortest path length between vertices i and j and is the total num- 
ber of vertices. Moreover, one defines the local efficiency as average efficiency of subgraphs 
Gi (the subgraph Gi is usually formed by the neurons directly connected to neuron i. We 
have included also all neurons occuring in the same layer as neuron i). 
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The connectivity length is defined by the inverse of efficiency. 
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It has been shown that D global is similar to L, and Diocai is similar to 1/C 2J| , although this 
is not an exact relation. Thus, if a network is small-world, both, Diocai and Dgi^bai should 
become small. Fig. [2] shows Diocai and Dgiobai as a function of the number of rewiring steps 
for a network of 5 neurons per layer and 5 layers. One observes that both, Diocai and D global, 
become small at about Nrewire ~ 20. The position of minima for other networks are shown 
in Tab. p. 

Small world networks are characterized by the scaling law L cx \og{Nnodes)- From the 
similarity of Dgiobai and L, we expect 

D global OC \og{Nnodes), (for N^odes ^ Oo) . (4) 

In Tab.p we display 5" = D'^^'^^J \og{N nodes), where -D^^L denotes the value of Dgiobai at 
the position where both Diocai and Dgiobai are small. The onset of scaling is observed when 
networks become larger. From our data we observed another scaling law for the variable 
R = N^^^-j.^/ Kconn- Our data are consistent with 

R = Rq + J: log{Kconn) (for Nnodes Oo) , (5) 

where Rq and E are constants. In Tab.|l] we display S. This law says that the number of 
rewirings associated with the minimum of Diocai and Dgiobai, i-6- small world connectivity, 
measured in terms of the number of connections of the regular network, increases with the 
number of connections in a logarithmic way. 



III. LEARNING 



We trained the networks with random binary input and output patterns. Our neurons 
are sigmoid real-valued units. The learning time, defined as the number of iterations it 
takes until the error of training becomes smaller than a given error tolerance, depends on 
the error tolerance. In the following we will present the learning error instead of learning 
time. The effect on learning due to a variation of the number of layers and the effect of 
rewiring, i.e. introducing short-cuts, is shown in Fig.jHj. It shows the absolute error, which 
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FIG. 4: Network of 5 neurons per layer and 5 layers. Learning of 5 patterns (a), 20 patterns (b) 
and 80 patterns (c). 
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FIG. 5: Network of 15 neurons per layer with 8 layers. Network was trained with 40 patterns by 
17 statistical tests. 

is an average of the absolute error over the neurons, patterns and statistical tests. We have 
considered a network of 5 neurons per layer and varied the number of layers from 5 to 8. 
We compare the regular network {Nj-ewire = 0) with the network of Nrewire = 40. We find 
that for up to a few thousand iterations the network with Nrewire = 40 gives a smaller 
error compared to the regular network. In the regime of many iterations, one observes 
that the 5-layer regular network learns (i.e. the error decreases with iterations), while the 
regular 8-layer network does not learn at all. In contrast, after 40 rewirings all networks 
do learn. The improvement over the regular network is small for 5 layers, but major for 
8 layers. This indicates that the learning improvement due to small-world architecture 
becomes more efficient in the presence of more layers. In the case of 8 layers, the error 
curve for Nrewire = 40 is in the regime where Diocai and D global are both small (see Tab.P), 
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i.e. close to the small-world architecture. 



0,08 - 
0,07- 
0,06- 
2 0,05- 

CD 

B 0,04- 

O 
</5 

.Q 0,03- 
CD 

0,02- 
0,01 - 
0,00- 



▲ 

o 
□ 
♦ 



♦ 

8 

A ■ 

2p9S 



■ 

e 

o 



♦ 

■ 

8 



■-2. ♦^ep 4 



t 



8 °, 

^□•8, 



o 
« 



Ob 

□ 

a8 



1. 



o 

• ■-♦8 

♦•9 



A* 
♦ A 
♦ 



A0 
♦ A 



□ 



4p. 6p. 8p. 10p. 12p 14p. 16p. 18p. 20p. 



Rewiring 

■ 

* 20 

♦ 40 

□ 60 

o 80 

• 100 



-r 
500000 



-r 
1000000 

Iterations 



1500000 



2000000 



FIG. 6: Network of 5 neurons per layer and 5 layers. Learning by adding patterns sequentially in 
groups. Network was trained by 20 statistical tests. 

The effect on learning when changing the number of learning patterns and the rewiring 
of the network is shown in Fig. |3] for a network of 5 neurons per layer and 5 layers. When 
learning 5 patterns (Fig. 0^]) in the domain of few iterations (1000-5000), rewiring brings 
about a substantial reduction in the error compared to the regular network {Nrewire = 0) 
and also to the random network {Nj-ewire > 100). For very many iterations (about 500000) 
there is little improvement with rewirings compared to the regular architecture. For learning 
of 20 patterns (Fig.^]) the behavior is similar, but the reduction of error in the presence 
of 20 rewirings is more substantial for many iterations. When learning 80 patterns Fig. ^] 
we see that the error is large, we are near or beyond the storage capacity. In this case the 
regular architecture is optimal and rewiring brings no advantage. 
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FIG. 7: Generalization. Network of 5 neurons per layer and 8 layers. Full symbols correspond to 
trained patterns, empty symbols correspond to untrained patterns. 

The influence of the number of neurons per layer on learning is depicted in Fig.[Sj. 
We compare (a) a network of 5 neurons per layer and 8 layers with a network (b) of 15 
neurons per layer and 8 layers. We trained the network with 40 patterns in both cases. 
In case (a) (not shown) we found that the error as function of rewiring has a minimum at 
Nrewire ~ 30 which is in coincidence with the minimum of Diocai and D global given in Tab.p, 
i.e in the small-world regime. This gives a clear improvement over the regular architecture 
{Nrewire = 0) and the random architecture {Nrewire > 100). In case (b) the learning error 
is shown in Fig.|H]. One observes that the learning error has a minimum at Nrewire ~ 750, 



which is close to the minimum of Diocai and D„inhni at A^, 



'global 



830 (see Tab.P), also near 



the small- world regime. 

So far we studied learning when the training patterns were given all together in the 
beginning. However, sometimes one is interested in changing strategies or parameters 
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during training (enforced learning). In order to see if the small world architecture gives 
an advantage in such scenarios, we studied learning by adding patterns sequentially in 
groups of 2 patterns. For a network of 5 neurons per layer and 5 layers, Fig. [H] shows 
the learning error versus the number of rewirings. Also here one observes a distinctive 
gain by introducing some rewirings (optimum at N^ewire ~ 40) over the regular network 
{Nrewire = 0) and the random network {Nrewire > 100). 

In the last experiment we studied generalization in a network of 5 neurons per layer and 
8 layers. We trained the network to put patterns into classes. We considered 5 classes: 
A pattern belongs to class 1, if neuron 1 in the input layer has the highest value of all 
neurons in this layer. Class 2 corresponds to neuron 2 having the highest value, etc. The 
classification of 200 patterns achieved by the network as function of connectivity is shown 
in Fig. [2j . It turns out the network with some rewirings gives improvement over the regular 
and also random network architecture. We observe that the generalization error has a 
minimum at Nrewire ~ 40, which is in the regime where Diocai and Dgiobai are small (see 
Tab.jl]), i.e. in the regime of small- world architecture. 

In summary, we observed that the network with some rewirings., i.e. short-cuts, gives 
reduced learning errors. The minimum of the error as function of rewiring lies in the 
neighborhood of the minimum of Diocai and Diocai which indicates small-world architecture. 
However, while Diocai and Dgiobai depend only on the connectivity of the network, learning 
error and learning time depend on other parameters like the number of learning patterns, 
or the mode of learning, e.g. standard learning or enforced learning. We believe that our 
results have important implications on artificial intelligence and artificial neural networks. 
Neural networks are being widely used as computational devices for optimization problems 
based on learning in applications like pattern recognition, image processing, speech recogni- 
tion, error detection, and quality control in industrial production, like of car parts or sales 
of air-line tickets. Our study reveals a clear advantage and suggests to use small-world 
networks in such applications. Our study may have potential applications in the domain 
of data-mining. 
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